Data Engineering
15 items in Data Engineering
Projects
Data Engineering Garden — Knowledge Base
A public digital garden of data engineering notes, concepts, and guides — built with Quartz v4 and published as a static site.
Lakehouse Platform
FeaturedA self-service data lakehouse built on Databricks and Delta Lake, unifying batch and streaming workloads with a single storage layer.
Blog Posts
Databricks Series, Part 6: ML Serving and Workflows
Batch and real-time model inference, Databricks Model Serving endpoints, and orchestrating the full ML pipeline with Databricks Workflows.
Databricks Series, Part 5: Machine Learning with MLflow
Tracking experiments, logging models and artifacts, comparing runs, and managing the model lifecycle with MLflow on Databricks.
Databricks Series, Part 4: Feature Engineering at Scale
Databricks Feature Store, FeatureEngineeringClient, FeatureLookup, training sets, and eliminating training-serving skew.
Databricks Series, Part 3: Data Ingestion with Auto Loader
cloudFiles format, schema inference, schema evolution, and building robust incremental ingestion pipelines on Databricks.
Databricks Series, Part 2: Lakehouse Architecture
Unity Catalog for governance and discovery, the medallion Bronze/Silver/Gold pattern, and Delta tables as the storage foundation.
Databricks Series, Part 1: Getting Started
Navigating the Databricks workspace, launching clusters, writing notebooks, and submitting your first PySpark job.
Databricks Series, Part 0: Overview
The lakehouse platform concept, what Databricks adds on top of Spark and Delta Lake, and how it compares to alternatives.
Spark Series, Part 4: Performance Tuning
Making Spark jobs fast — partitioning, shuffles, skew, caching, and the most common bottlenecks in production.
Spark Series, Part 3: Structured Streaming
Real-time data processing with Spark Structured Streaming — micro-batches, triggers, watermarks, and output modes.
Spark Series, Part 2: DataFrames and Spark SQL
The practical Spark API — working with structured data using DataFrames, schemas, and SQL queries.
Spark Series, Part 1: RDDs and the Execution Model
Understanding Resilient Distributed Datasets — the foundation of Spark's execution model, transformations, actions, and lazy evaluation.
Spark Series, Part 0: Overview
A high-level introduction to Apache Spark — what it is, why it exists, and where it fits in the modern data stack.
Designing a Data Platform That Doesn't Rot
Lessons from building internal data platforms: what makes them last, what kills them, and the principles I try to apply.